Don Merrion of the Division of Local Government in the Colorado Department of Local Affairs was kind of enough to compile and share data capturing information about county and local finances over the 1975-2009 period. He actually gave this to me months ago, but World Bank work (pretty tough to be dismayed by that) and frequent travel have hindered my exploration of it. At long last, I am going to take a look at what I have here. The data reside on home/choct155/Google Drive/Diseertation/Data/.
pwd
ls
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import pandas.io.data as web
from IPython.display import HTML
#Set print widith
pd.set_option('line_width',140)
#Set plot style
pd.options.display.mpl_style='default'
#Establish working directory
workdir='/home/choct155/Google Drive/Dissertation/Data/'
#Read in data
county_ent=pd.read_csv(workdir+'COUNTY_ENT_1975_TO_2009.csv')
county_gg=pd.read_csv(workdir+'COUNTY_GG_1975_TO_2009.csv')
muni_ent=pd.read_csv(workdir+'MUNY_ENT_1975_to_2009.csv')
muni_gg=pd.read_csv(workdir+'MUNY_GG_1975_to_2009.csv')
print county_ent
print county_gg
print muni_ent
print muni_gg
It looks like general fund data are in the 'gg' files, so let's start there. There are fewer county records, so that can be the guinea pig. The DF is too wide to display, so we will display the head() in chunks.
for i in range(5):
print county_gg.head(5)[county_gg.columns[i*10:(i+1)*10]].to_string()
print county_gg.head(5)[county_gg.columns[50:]].to_string()
Ok, I am assuming that LG_ID is a unique identifier for each jurisdiction. Lucky for me, I have a handy function to make this easy...
def one2one(seq1,seq2):
seq_dict=dict(zip(seq1,seq2))
values=[]
for key in seq_dict:
values.append(seq_dict[key])
print len(seq_dict.keys()),'|',len(set(values))
return len(seq_dict.keys())==len(set(values))
one2one(county_gg['LG_ID'],county_gg['NAME'])
So, they are unique identifiers, which means either can be used in the index. Since LG_ID doesn't provide additional information, no use carrying dead weight.
#Drop county ID
county_gg.pop('LG_ID')
I am also going to throw year and jurisdictional type into the index, but first I need to look at how many jurisdictional types we have. It's probably also worth checking the completeness of years.
#Count jurisdictions
print county_gg['LGTYPE_ID'].value_counts()
#Do all years have the same number of obs?
plt.rcParams['figure.figsize']=10,3
print county_gg['AUDIT_YEAR'].value_counts().sort_index().plot(kind='bar',title='Completeness by year')
#Set index
c_gg=county_gg.set_index(['NAME','AUDIT_YEAR','LGTYPE_ID'])
So we have full coverage on the years, and we have discovered two additional jurisdictional types. I assume type 1 is a general purpose jurisdiction, while the others (types 61 and 70) represent special purpose districts of some sort. If we can't tell from the names, we will have to check this out a bit later.
To check these items, we will examine a cross-section by LGTYPE_ID and pull the NAME attribute from the index. We really only care about unique names, so set() again comes into play.
print set(c_gg.xs(61,level='LGTYPE_ID').index.get_level_values(level='NAME'))
print set(c_gg.xs(70,level='LGTYPE_ID').index.get_level_values(level='NAME'))
Both 61 and 70 identify singular entities. The odd thing is that they are both City/County combinations, so I am unclear on why they have different LGTYPE_ID.
In any event, just to provide an easy reference, here are the rest of the jurisdictions.
print set(c_gg.xs(1,level='LGTYPE_ID').index.get_level_values(level='NAME'))
We have too many columns to deal with at the current time, and it appears that they are hierarchical. Fortunately, the Division of Local Government in CO has provided data summaries by year that reveal the nesting behavior. Consequently, we can reduce these to some extent.
HTML('<iframe src=http://www.colorado.gov/cs/Satellite?blobcol=urldata&blobheadername1=Content-Disposition&blobheadername2='+
'Content-Type&blobheadervalue1=inline%3B+filename%3D%222009+County+Compendium+Information.pdf%22&blobheadervalue2=application%2Fpdf&blobkey='+
'id&blobtable=MungoBlobs&blobwhere=1251848971158&ssbinary=true width=1000 height=950></iframe>')
#for coll in c_gg.columns:
# print coll
#Construct hierarchical dictionaries
c_rev_dict={'REV_TOTAL':['REV_TOTAL_TAX','REV_LICENSES','REV_INTGOVT','REV_CHARGES','REV_FINES','REV_MISC','REV_TRANSFER_IN'],
'REV_TOTAL_TAX':['REV_PROPERTY_TAX','REV_SO_TAX','REV_SALES_USE_TAX','REV_FRANCHISE_TAX','REV_OCCUPATION_TAX','REV_OTHER_TAX','REV_UNCLASS_TAX'],
'REV_INTGOVT':['REV_HUT','REV_CIGARETTE_TAX','REV_MOTOR_VEH_FEE','REV_CTF','REV_SOCIAL_SERVICE','ALL_OTHER_INTGOVT']}
c_exp_dict={'EXP_TOTAL':['EXP_TOTAL_OPERATING','EXP_TRANSFER_OUT','EXP_CAPITAL_OUTLAY','EXP_DEBT_SERVICE_GEN'],
'EXP_TOTAL_OPERATING':['EXP_GEN_GOVT','EXP_JUDICIAL','EXP_TOTAL_PUBLIC_SAFETY','EXP_TOTAL_PUBLIC_WORKS','EXP_HEALTH','EXP_RECREATION','EXP_SOCIAL_SERVICE','EXP_MISC'],
'EXP_TOTAL_PUBLIC_SAFETY':['EXP_POLICE','EXP_FIRE','EXP_OTHER_PUBLIC_SAFETY'],
'EXP_TOTAL_PUBLIC_WORKS':['EXP_STREET','EXP_TRASH','EXP_OTHER_PUBLIC_WORKS'],
'EXP_DEBT_SERVICE_GEN':['EXP_PRINCIPAL_GEN','EXP_INTEREST_GEN']}
c_debt_dict={'TOTAL_DEBT':['GO_DEBT_GEN','REVENUE_DEBT_GEN','OTHER_DEBT_GEN']}
There are a few things to note here: >1. TOTAL_DEBT does not exist in the current set, so we will create it by summing the other outstanding debt variables (those found in the values array of debt_dict);
- The data include information about sales tax rate, retail sales activity, and state sales tax collected. We will leave this alone for the time being; and,
- The data include population information, which will be retained to provide per capita perspectives.
To test the validity of our nesting scheme, the first thing we should do is subset to a single jurisdiction and see if our values add up. Adams County should work as well as any other.
The validation check loops through each key of each dictionary created. It then identifies the 'total' column using the key, and the 'component' columns using the values associated with said key. A boolean check for equality of the total and sum of the components is deposited into a Series object (called 'same' below). This happens because the high-level boolean check is actually a vectorized operation. The check is thus performed for each year in the 'total' and 'component' Series (which I think need to be of the same length). If the check encounters an inequality, it prints False, the index position, the corresponding component values, and the difference between the given 'total' value and the value calculated as the sum of the 'components'. It is important to capture both the component values and the difference because if the difference corresponds to one of the component values, we know which one shouldn't be there.
#Create TOTAL_DEBT
c_gg['TOTAL_DEBT']=c_gg[['GO_DEBT_GEN','REVENUE_DEBT_GEN','OTHER_DEBT_GEN']].sum(axis=1)
#Generate Adams County subset
adams=c_gg.ix['Adams County']
for dct in [c_rev_dict,c_exp_dict,c_debt_dict]:
for key in dct:
same=adams[key]==adams[dct[key]].sum(axis=1)
for i in range(len(same)):
if same[i]==False:
print 'KEY (',key,'): THE FOLLOWING COMPONENTS ARE NOT FULLY SPECIFIED\n'
print same[i], same.index[i]
print adams[dct[key]][i:i+1],'\n'
print 'DIFFERENCE >>> ',adams[key][i:i+1]-adams[dct[key]].sum(axis=1)[i:i+1]
else:
pass
This ended up being an important check because I did add in something that shouldn't be there. For TOTAL_EXP, I was unsure of what to do with EXP_DEBT_SERVICE_GEN. It appears in the input data, but it does not appear in the data summaries mentioned above (on the website). I figured that it should just be grouped with principal (EXP_PRINCIPAL_GEN) and interest (EXP_INTEREST_GEN) payments, but this is not actually the case. As it turns out, EXP_DEBT_SERVICE_GEN was actually a summary wrapper for EXP_PRINCIPAL_GEN and EXP_INTEREST_GEN, and thus warranted its own key with these items as components.
In any case, we are all good now and explore the data a bit. We can take a look at the very wide angle view capturing total revenues over time, but first, we should deflate the values. We can pull CPI information from the FRED database.
#Open feed to CPI info
cpi=web.get_data_fred('USACPIBLS','1/1/1975','1/1/2009')
#Create copy in case offline work is occurring
cpi.to_csv(workdir+'us_cpi_1975_2009.csv')
#Create common index for later join with data
cpi['AUDIT_YEAR']=range(1975,2010)
cpi2=cpi.reset_index().set_index('AUDIT_YEAR')
#Rename columns
cpi2.columns=['DATE','CPI']
#Drop DATE
cpi2.pop('DATE')
#Calculate deflator ratios
dfl=cpi2.div(cpi2.ix[2009])
dfl.head()
Now we can join with the county set for deflation. Notice the technique here, it's basically an implicit join. In fact, this method renders irrelevant the name matching I did above. This is a much faster way of matching on a MultiIndex than I have attempted in the past. Longer head() snapshots reveal a correct merge.
#Implicit merge
c_gg['DFL']=dfl.ix[c_gg.index.get_level_values(level='AUDIT_YEAR')].values
print c_gg[['REV_TOTAL','DFL']].head(5)
Now we need to broadcast the DFL across columns.
#Divide each column by deflator ratio
cgg_real=c_gg.div(c_gg['DFL'],axis=0)
First a few summary plots looking at the trend in revenue from all counties over time.
#Group by year
cgg_yr=cgg_real.groupby(level='AUDIT_YEAR')
#Plot total, avg, and median revenues over time
plt.rcParams['figure.figsize']=20,10
fig,axes=plt.subplots(1,2)
cgg_yr['REV_TOTAL'].sum().plot(kind='line',ax=axes[0],linewidth=3,title='Total Revenues from All Counties in Colorado')
cgg_yr['REV_TOTAL'].mean().plot(kind='line',ax=axes[1],linewidth=3,label='AVG')
cgg_yr['REV_TOTAL'].median().plot(kind='line',ax=axes[1],linestyle='--',linewidth=3,title='Average and Median Revenues from All Counties in Colorado',label='MED')
#Generate a box plot of REV_TOTAL
cgg_real.reset_index('AUDIT_YEAR').boxplot(column='REV_TOTAL',by='AUDIT_YEAR')
Note that while revenues are growing over time in real terms (even for the median jurisdiction), the spread between average and median revenues is growing. This is reaffirmed by the expanding inter-quartile spread in the boxplot, and can have implications for exacerbating the impact of tax and expenditure limitations. In other words, it looks like it will be easier to tease out categorical differences between jurisdictions over time.
It would be interesting to know how the shape of the distribution has changed over time.
#Create cross-sections starting in 1979 and preceding at decade clips to 2009 and plot them
plt.rcParams['figure.figsize']=20,10
for i in range(4):
c_gg_cross=cgg_real.xs((1979+i*10),level='AUDIT_YEAR')
c_gg_cross['REV_TOTAL'].plot(kind='kde',linewidth=3,title='Evolution of County Revenue Shape Over Time',label=str((1979+i*10)))
plt.xlim([-500000000,1500000000])
plt.legend(loc='best')
In real terms, the distribution of revenues is flattening over time. The peaks are shifting rightward, and the non-overlapping area relative to previous years is larger on the positive side. This suggests that even if the absolute spread between median and average jurisdictions is spreading, the relative spread may be going in the other direction.
A more continuous display may be useful...
#Create cross-sections starting in 1979 and preceding at decade clips to 2009 and plot them
plt.rcParams['figure.figsize']=20,10
for i in range(31):
c_gg_cross=cgg_real.xs((1979+i),level='AUDIT_YEAR')
c_gg_cross['REV_TOTAL'].plot(kind='kde',linewidth=3,title='Evolution of County Revenue Shape Over Time',color=[(i/float(31)),((i+1)/float(32)),((i+2)/float(33))],label=str((1979+i)))
plt.xlim([-500000000,1500000000])
plt.legend(bbox_to_anchor=(0.0,-.23,1.,.0),loc=3,ncol=13)
plt.figure()
for i in range(31):
c_gg_cross=cgg_real.xs((1979+i),level='AUDIT_YEAR')
c_gg_cross['REV_TOTAL'].plot(kind='kde',linewidth=3,title='Evolution of County Revenue Shape Over Time',color=(((31-i)/float(31)),0.,(i/float(31))),label=str((1979+i)))
plt.xlim([-500000000,1500000000])
plt.legend(bbox_to_anchor=(0.0,-.23,1.,.0),loc=3,ncol=13)
To get a continuous color distribution, we had to utilize RGB sequences. Each sequence is a tuple of three values on the interval [0-1]. To get subtle changes across each year, we employed the iterator to make small changes in one or more of the three sequence positions. More information on the matplotlib (the default graphing base for pandas) treatment of colors can be found here.
It would be interesting to see if expenditures follow a similar pattern. The next plot uses only three time periods and overlays expenditure distribution with dotted lines.
plt.rcParams['figure.figsize']=20,10
colors=['r','b','g']
for i in range(3):
c_gg_cross=cgg_real.xs((1979+i*15),level='AUDIT_YEAR')
c_gg_cross['REV_TOTAL'].plot(kind='kde',linewidth=3,title='Evolution of County Revenue Shape Over Time',color=colors[i],label=str(1979+i*15)+'REV')
c_gg_cross['EXP_TOTAL'].plot(kind='kde',linewidth=3,linestyle='--',title='Evolution of County Revenue/Expenditure Shape Over Time',color=colors[i],label=str(1979+i*15)+'EXP')
plt.xlim([-500000000,1500000000])
plt.legend(loc='best')
There is a noticeable consistency in pattern, but the disparity between revenues and expenditures in 2009 is dramatic. This is likely a consequence of the bursting housing bubble. Sure would be nice to have another four years of data. Perhaps that can still be acquired.
Ok, let's get into the compositional aspects. For the time being, we are still using the set aggregated by year and we will look at revenues and then expenditures. For each we will consider compositional shifts, absolute growth, and indexed growth. (Debt dynamics will be assessed when actively required.)
plt.rcParams['figure.figsize']=25,20
fig,axes=plt.subplots(3)
i=0
for key in c_rev_dict.keys():
#Subset to columns of interest
cols=[key]+c_rev_dict[key]
sub_df=cgg_yr.sum().reindex(columns=[cols])
sub_df[c_rev_dict[key]].div(sub_df[key],axis=0).plot(kind='bar',stacked=True,ax=axes[i],title='Composition of '+key+' Revenue over Time')
axes[i].legend(bbox_to_anchor=(0.03,-.33,1.,.0),loc=3,ncol=len(c_rev_dict[key])+1)
i+=1
Absolute Growth...
fig,axes=plt.subplots(3)
i=0
for key in c_rev_dict.keys():
#Subset to columns of interest
cols=[key]+c_rev_dict[key]
sub_df=cgg_yr.sum().reindex(columns=[cols])
sub_df[c_rev_dict[key]].div(sub_df[key],axis=0).plot(kind='line',ax=axes[i],linewidth=4,title='Proportional Growth of '+key+' Component Revenue over Time')
axes[i].legend(bbox_to_anchor=(0.03,-.33,1.,.0),loc=3,ncol=len(c_rev_dict[key])+1)
i+=1
Index Growth...
fig,axes=plt.subplots(3)
i=0
for key in c_rev_dict.keys():
#Subset to columns of interest
cols=c_rev_dict[key]
sub_df=cgg_yr.sum().reindex(columns=[cols])
DataFrame((sub_df/sub_df.ix[1975])*100).plot(kind='line',ax=axes[i],linewidth=5,title='Indexed Absolute Growth of '+key+' Revenue over Time')
axes[i].legend(bbox_to_anchor=(0.03,-.33,1.,.0),loc=3,ncol=len(c_rev_dict[key])+1)
i+=1
While property tax is clearly the dominant factor in county finances, there are some other interesting things going. For one, the fastest growing revenue source of the period has been transfers from enterprises. It remains a relatively small factor, but it's growth is undeniable. Couple this with marked increase in franchise taxes relative to other tax collections, and it suggests a greater level of business activity.
Secondly, sales and use tax made an aggressive push to approach property tax for dominance in the late 1990s, but there was a sharp deflection in 2000. This compositional action was driven almost entirely by sales and use growth rather than a property decline.
It is also interesting to note that most of the intergovernmental sources of revenue experienced flat or declining growth over the period. From a compositional standpoint, total intergovernmental importance declined in importance of the period, and it appears tax collections picked up the slack.
Replicating these views for expenditures...
plt.rcParams['figure.figsize']=25,25
fig,axes=plt.subplots(5)
i=0
for key in c_exp_dict.keys():
#Subset to columns of interest
cols=[key]+c_exp_dict[key]
sub_df=cgg_yr.sum().reindex(columns=[cols])
sub_df[c_exp_dict[key]].div(sub_df[key],axis=0).plot(kind='bar',stacked=True,ax=axes[i],title='Composition of '+key+' Spending over Time')
axes[i].legend(bbox_to_anchor=(0.03,-.33,1.,.0),loc=3,ncol=len(c_exp_dict[key])+1)
i+=1
fig,axes=plt.subplots(5)
i=0
for key in c_exp_dict.keys():
#Subset to columns of interest
cols=[key]+c_exp_dict[key]
sub_df=cgg_yr.sum().reindex(columns=[cols])
sub_df[c_exp_dict[key]].div(sub_df[key],axis=0).plot(kind='line',ax=axes[i],linewidth=4,title='Proportional Growth of '+key+' Component Spending over Time')
axes[i].legend(bbox_to_anchor=(0.03,-.33,1.,.0),loc=3,ncol=len(c_exp_dict[key])+1)
i+=1
fig,axes=plt.subplots(5)
i=0
for key in c_exp_dict.keys():
#Subset to columns of interest
cols=c_exp_dict[key]
sub_df=cgg_yr.sum().reindex(columns=[cols])
DataFrame((sub_df/sub_df.ix[1975])*100).plot(kind='line',ax=axes[i],linewidth=5,title='Indexed Absolute Growth of '+key+' Spending over Time')
axes[i].legend(bbox_to_anchor=(0.03,-.33,1.,.0),loc=3,ncol=len(c_exp_dict[key])+1)
i+=1
Perhaps the most interesting part of the expenditure side is categorically different behavior from revenues with respect to growth. Expenditures are increasing without regard to revenue behavior. Health spending stands out as a growing portion of the budget, while social services has scaled back. None of this is really surprising. In fact, that's the point. Budget pressures will exist whether or not the levy is constrained.
Ok, I want to know about what's happening below the aggregated surface. Are we seeing individual jurisdictions following the broader trends? To compare with total growth across jurisdictions and with the whole, it will be useful to generate a set of per capita fiscal behavior. In deflating the values for charting, we also deflated the population, which is clearly ridiculous. Consequently, we need to reintegrate the old population figure.
#Integrate old population figure into real value DF
cgg_real['POP']=c_gg.ix[cgg_real.index]['POPULATION'].values
#Restrict columns
cols=['REV_TOTAL', 'REV_TOTAL_TAX', 'REV_PROPERTY_TAX', 'REV_SO_TAX', 'REV_SALES_USE_TAX', 'REV_OCCUPATION_TAX', 'REV_FRANCHISE_TAX', 'REV_OTHER_TAX', \
'REV_LODGING_TAX', 'REV_REAL_ESTATE_TRANSFER_TAX', 'REV_UNCLASS_TAX', 'REV_LICENSES', 'REV_CHARGES', 'REV_FINES', 'REV_TRANSFER_IN', 'REV_INTGOVT', \
'REV_HUT', 'REV_CIGARETTE_TAX', 'REV_MOTOR_VEH_FEE', 'REV_CTF', 'REV_SOCIAL_SERVICE', 'ALL_OTHER_INTGOVT', 'REV_MISC', 'REV_INTEREST', 'EXP_TOTAL', \
'EXP_TOTAL_OPERATING', 'EXP_GEN_GOVT', 'EXP_JUDICIAL', 'EXP_TOTAL_PUBLIC_SAFETY', 'EXP_POLICE', 'EXP_FIRE', 'EXP_OTHER_PUBLIC_SAFETY', 'EXP_TOTAL_PUBLIC_WORKS', \
'EXP_STREET', 'EXP_TRASH', 'EXP_OTHER_PUBLIC_WORKS', 'EXP_HEALTH', 'EXP_RECREATION', 'EXP_SOCIAL_SERVICE', 'EXP_MISC', 'EXP_TRANSFER_OUT', 'EXP_CAPITAL_OUTLAY', \
'EXP_DEBT_SERVICE_GEN', 'EXP_PRINCIPAL_GEN', 'EXP_INTEREST_GEN', 'GO_DEBT_GEN', 'REVENUE_DEBT_GEN', 'OTHER_DEBT_GEN', 'ASSETS', 'LIABILITIES', \
'RETAIL_SALES', 'ST_SALES_TAX_PAID', 'TOTAL_DEBT', 'POP']
cgg_real2=cgg_real.reindex(columns=cols)
#Calculate per capita set
cgg_rpc=cgg_real.div(cgg_real['POP'],axis=0)
Now we can take a look...
#Group by year
cgg_rpcg=cgg_rpc.groupby(level='AUDIT_YEAR')
#Plot median and average per capita revenues by year
plt.rcParams['figure.figsize']=20,10
#fig,axes=plt.subplots(1)
#Populate subplot #1
cgg_rpcg['REV_TOTAL'].mean().plot(kind='line',linewidth=3,title='Average Per Capita Revenues from All Counties in Colorado',label='AVG')
#Populate sublot #2
cgg_rpcg['REV_TOTAL'].median().plot(kind='line',linewidth=3,linestyle='--',title='Median Per Capita Revenues from All Counties in Colorado',label='MED')
"""
#Generate list of jurisdictions
county_list=[]
for county in cgg_rpc.index.get_level_values(level='NAME'):
county_list.append(county)
print set(county_list)
"""
unsorted_counties=['Gilpin County', 'Las Animas County', 'Morgan County', 'Alamosa County', 'Dolores County', 'Summit County', 'Jefferson County', \
'Pueblo County', 'Jackson County', 'Clear Creek County', 'Bent County', 'Teller County', 'Rio Grande County', 'Montrose County', \
'Adams County', 'Moffat County', 'San Miguel County', 'Larimer County', 'Mesa County', 'Archuleta County', 'Sedgwick County', \
'Arapahoe County', 'Hinsdale County', 'Lincoln County', 'Denver, City And County of', 'Mineral County', 'Weld County', 'Huerfano County', \
'Montezuma County', 'Routt County', 'Garfield County', 'Custer County', 'Saguache County', 'Ouray County', 'La Plata County', \
'Otero County', 'Boulder County', 'Baca County', 'Conejos County', 'Lake County', 'Washington County', 'Eagle County', 'Douglas County', \
'Logan County', 'Costilla County', 'Kiowa County', 'Yuma County', 'Chaffee County', 'Delta County', 'Phillips County', 'El Paso County', \
'Fremont County', 'Rio Blanco County', 'San Juan County', 'Gunnison County', 'Elbert County', 'Broomfield, City and County of', \
'Cheyenne County', 'Park County', 'Kit Carson County', 'Prowers County', 'Pitkin County', 'Crowley County', 'Grand County']
#Plot average and median per capita revenues over time with individual counties as context
plt.rcParams['figure.figsize']=20,20
fig,axes=plt.subplots(2)
#Populate subplot #1
for county in unsorted_counties:
cgg_rpc.xs(county,level='NAME').reset_index('LGTYPE_ID')['REV_TOTAL'].plot(kind='line',ax=axes[0],linewidth=1,label=county,alpha=.5)
cgg_rpcg['REV_TOTAL'].mean().plot(kind='line',ax=axes[0],linewidth=10,linestyle='--',title='Average vs Real Per Capita Revenues from All Counties in Colorado',label='AVG')
#Populate sublot #2
for county in unsorted_counties:
cgg_rpc.xs(county,level='NAME').reset_index('LGTYPE_ID')['REV_TOTAL'].plot(kind='line',ax=axes[1],linewidth=1,label=county,alpha=.5)
cgg_rpcg['REV_TOTAL'].median().plot(kind='line',ax=axes[1],linewidth=10,linestyle='--',title='Median vs Real Per Capita Revenues from All Counties in Colorado',label='MED')
#Generate a box plot of REV_TOTAL
plt.rcParams['figure.figsize']=20,10
cgg_rpc.reset_index('AUDIT_YEAR').boxplot(column='REV_TOTAL',by='AUDIT_YEAR')
Well the distribution certainly looks a lot more reasonably consistent in per capita terms. In fact, it looks like median and average per capita revenues are very well correlated.
The second two plots demonstrate the substantial variation that occurs around measures of central tendency. The volatility is so great that per capita growth looks almost nonexistent!
What about the shape of the distribution?
for i in range(31):
c_gg_cross=cgg_rpc.xs((1979+i),level='AUDIT_YEAR')
c_gg_cross['REV_TOTAL'].plot(kind='kde',linewidth=3,title='Evolution of County Per Capita Revenue Shape Over Time',color=(((31-i)/float(31)),0.,(i/float(31))),label=str((1979+i)))
plt.xlim([-2000,12000])
plt.legend(bbox_to_anchor=(0.0,-.23,1.,.0),loc=3,ncol=13)
And now for a look at expenditures...
#Plot median and average per capita expenditures by year
plt.rcParams['figure.figsize']=20,10
#fig,axes=plt.subplots(1)
#Populate subplot #1
cgg_rpcg['EXP_TOTAL'].mean().plot(kind='line',linewidth=3,title='Average Per Capita Expenditures from All Counties in Colorado',label='AVG')
#Populate sublot #2
cgg_rpcg['EXP_TOTAL'].median().plot(kind='line',linewidth=3,linestyle='--',title='Median Per Capita Expenditures from All Counties in Colorado',label='MED')
#Plot average and median per capita expenditures over time with individual counties as context
plt.rcParams['figure.figsize']=20,20
fig,axes=plt.subplots(2)
#Populate subplot #1
for county in unsorted_counties:
cgg_rpc.xs(county,level='NAME').reset_index('LGTYPE_ID')['EXP_TOTAL'].plot(kind='line',ax=axes[0],linewidth=1,label=county,alpha=.5)
cgg_rpcg['EXP_TOTAL'].mean().plot(kind='line',ax=axes[0],linewidth=10,linestyle='--',title='Average vs Real Per Capita Spending from All Counties in Colorado',label='AVG')
#Populate sublot #2
for county in unsorted_counties:
cgg_rpc.xs(county,level='NAME').reset_index('LGTYPE_ID')['EXP_TOTAL'].plot(kind='line',ax=axes[1],linewidth=1,label=county,alpha=.5)
cgg_rpcg['EXP_TOTAL'].median().plot(kind='line',ax=axes[1],linewidth=10,linestyle='--',title='Median vs Real Per Capita Spending from All Counties in Colorado',label='MED')
#Generate a box plot of REV_TOTAL
plt.rcParams['figure.figsize']=20,10
cgg_rpc.reset_index('AUDIT_YEAR').boxplot(column='EXP_TOTAL',by='AUDIT_YEAR')
To my mind, per capita expenditures look a bit more volatile than per capita revenues. Is that actually the case? Coefficient of variation might provide a reasonable view.
DataFrame(cgg_rpcg.std()/cgg_rpcg.mean())[['REV_TOTAL','EXP_TOTAL']].plot(kind='bar',title='CoV for Per Capita Revenues & Expenditures')
And that's why you check these things! Evidently they are quite similarly volatile.
What about the expenditure distribution over time?
for i in range(31):
c_gg_cross=cgg_rpc.xs((1979+i),level='AUDIT_YEAR')
c_gg_cross['EXP_TOTAL'].plot(kind='kde',linewidth=3,title='Evolution of County Per Capita Expenditure Shape Over Time',color=(((31-i)/float(31)),0.,(i/float(31))),label=str((1979+i)))
plt.xlim([-2000,12000])
plt.legend(bbox_to_anchor=(0.0,-.23,1.,.0),loc=3,ncol=13)
Ultimately, my goal is to be able to characterize sub-state jurisdictions. I think the above view give enough of a wide angle view, but it doesn't add much value to repeat the compositional views unless I can draw distinctions across types of within-class jurisdictions. Currently, I am looking at counties, but any of the classification methods can be extended to local governments as well.
The first thing that comes to mind are scale effects. Before integration of TEL effects, it's important to understand whether or not simply having more people influences the efficiency of service delivery as measured by revenues or expenditure per capita. It would then be useful to see if growth trajectories look different according to scale. Exploratory analysis will utilize basic OLS, and then cursory Panel analysis.
To start, pooled OLS will suffice. For the sake of procedural familiarity, we will leave only time in the index. However, we must first reintegrate population with the per capita set (it was reduced to 1 when calculating per capita values).
#Reintegrate population
cgg_rpc['POP2']=c_gg.ix[cgg_rpc.index]['POPULATION'].values
#Limit available columns
ols_cols=[x for x in cgg_rpc.columns if x not in ['POPULATION','DFL','POP','SALES_TAX_RATE']]
#Isolate time in index
cgg_ols=cgg_rpc.reset_index(['NAME','LGTYPE_ID'])[ols_cols]
#Creat column slice lists
ols_y1='REV_TOTAL'
ols_y2='EXP_TOTAL'
ols_x=['POP2','RETAIL_SALES']
#Run OLS
rev_ols=pd.ols(y=cgg_ols[ols_y1],x=cgg_ols[ols_x])
exp_ols=pd.ols(y=cgg_ols[ols_y2],x=cgg_ols[ols_x])
print rev_ols
print exp_ols
#Capture impacts for display
ols_res=[rev_ols.beta[:2],exp_ols.beta[:2]]
ols_results=pd.concat(ols_res,axis=1)
ols_results.columns=['rev','exp']
print ols_results
#Plot results
plt.rcParams['figure.figsize']=10,5
ols_results.plot(kind='bar',title='OLS Coefficient (Population, Sales Volume Per Capita) Values')
As can be seen, the explanatory power of scale (as measured by population and retail sales volume per capita) is absolutely abysmal. This bodes well, actually, because it leave more explanatory room for tax policy. Here's to keeping our fingers crossed!
In the meantime, let's see if a panel run (with jurisdiction in the index) makes a difference (still pooled OLS).
#Isolate time and jurisdiction in index
cggpc_panel=cgg_rpc.reset_index('LGTYPE_ID')[ols_cols]
#Run pooled OLS
rev_pool=pd.ols(y=cggpc_panel[ols_y1],x=cggpc_panel[ols_x])
exp_pool=pd.ols(y=cggpc_panel[ols_y2],x=cggpc_panel[ols_x])
print rev_pool
print exp_pool
#Capture impacts for display
pool_res=[rev_pool.beta[:2],exp_pool.beta[:2]]
pool_results=pd.concat(pool_res,axis=1)
pool_results.columns=['rev','exp']
print pool_results
#Plot results
plt.rcParams['figure.figsize']=10,5
pool_results.plot(kind='bar',title='Pooled OLS Coefficient (Population, Sales Volume Per Capita) Values')
Hmmm, pooled OLS should have returned the same thing... right? Is it not just treating the set as a large cross-section? I need to explore this further. The increase in explanatory power seems to suggest that it was more an average of within group effects rather than an average effect over all observations...
I don't know, I have been staring at this screen almost constantly for about 11 hours, so maybe I am just having a giant brain fart...